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Abstract 

Making good predictions of a physical system using a computer code requires the inputs 
to be carefully specified. Some of these inputs called control variables have to reproduce 
physical conditions whereas other inputs, called parameters, are specific to the computer 
code and most often uncertain. The goal of statistical calibration consists in estimating 
these parameters with the help of a statistical model which links the code outputs with the 
field measurements. In a Bayesian setting, the posterior distribution of these parameters 
is normally sampled using MCMC methods. However, they are impractical when the code 
runs are high time-consuming. A way to circumvent this issue consists of replacing the com¬ 
puter code with a Gaussian process emulator, then sampling a clreap-to-evaluate posterior 
distribution based on it. Doing so, calibration is subject to an error which strongly depends 
on the numerical design of experiments used to fit the emulator. We aim at reducing this 
error by building a proper sequential design by means of the Expected Improvement crite¬ 
rion. Numerical illustrations in several dimensions assess the efficiency of such sequential 
strategies. 

Keywords: Bayesian calibration, Gaussian process emulator, Expected Improvement crite¬ 
rion. 


1 Introduction 

This work is incorporated within the field of uncertainty quantification in computer experiments. 
A crucial issue in engineering (aerospace, car, nuclear, etc.) concerns the ability of computer 
codes (also called simulators or computer models) to mimic a physical phenomenon of interest 
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as well as possible. In this regard, the field of so-called Verification and Validation (V&V) aims 
at assessing the accuracy of computer predictions for many applications. For instance, the study 
of V&V has become a huge preoccupation in the nuclear industry where numerical simulation 
is more and more used to assess the safety of installations for which physical experiments are 
impractical or economically infeasible. An essential prerequisite of V&V consists in quantifying 
all sources of uncertainty involved in a code output (Roy and Oberkampf. 2011). Our paper 
is focused on the reduction of one of them, called parameter uncertainty, caused by the lack 


of knowledge about the value of parameters which are specific to the computer code (Kennedy 


and O’Hagan, 2001). They can be either non-measurable physical quantities or just tuning 


factors. Calibration comes down to a statistical inference of these parameters after assuming 
a statistical model which makes explicit the relationship between the code outputs and the 
field measurements (Campbell, 2006, Cox et ah, 2001). Another popular framework which 


deals with parameter uncertainty is called History Matching (HM) (Craig et ah. 1997). HM 


aims at detecting regions of parameter space which appear to be incompatible with the field 
measurements. Based on an implausibility measure, this method is well-suited for large systems 
for which the size of inputs makes immediate calibration intractable. In the same way, Sensitivity 
Analysis (SA) aims at detecting which parameters have a negligible effect on the code output 
(Saltelli e t al.[ 2000). Hence, both HM and SA can shrink the input space before carrying out 
calibration. 

Our work focuses on calibration in a Bayesian fashion (Kennedy and O’Hagan, 2001| Bayarri 


et ah. 2007), rather than frequentist methods (Loeppky et ah, 2006; Gramacy et ah, 2014, Wong 


et ah, 2014). This strategy of inference provides an appropriate framework to quantify the 
parameter uncertainty from prior to posterior distribution as new data become available. In 
the literature, Bayesian calibration is performed within a framework where the code predictions 
suffer from a systematic discrepancy for any value of parameters, which reflects the view that the 
mathematical equations underlying the code should be not considered as a perfect modeling of 
the real world ( |Kennedy and Q’Hagan 2001 Higdon et ah 2004). Even if this framework is more 
realistic, some confounding can appear when the parameters and the shape of the discrepancy 


are jointly estimated (Loeppky et ah, 2006; Brynjarsdottir and O’Hagan, 2014). Because this 


issue is not related with the scope of this paper, our presentation is centered on a statistical 
model which does not include the code discrepancy. However, it will be possible to generalize 
our framework provided the shape of the discrepancy is provided by prior expertise. 

We aim at addressing Bayesian calibration when the code runs are time-consuming, which 
is a critical issue frequently arising in the field of computer experiments. Indeed, as simulations 
needs several hours, even several days to run, at that time the MCMC algorithms become 
burdensome. For instance, if each simulation lasts one hour, then 10000 simulations launched 
by an MCMC exploration of the parameter space will require more than a year, making the 
calibration process impractical. A well-known solution to this issue is to replace the code in 
the likelihood expression by a Gaussian process emulator (GPE) which is constructed from a 
training set of simulations run over a set of input locations, the so-called design of experiments. 
This paper propose two new algorithms for building sequential designs aiming at reducing the 
calibration error induced by the uncertainty of the GPE. Although it was already mentioned 
by Kennedy and O’Hagan (2001) as an important research axis, few papers deal with that 
issue. To the best of our knowledge, Kumar (|2008[) has already proposed some empirical criteria 
for sequentially selecting the code runs. Pratola et al. (2013) have also proposed some adaptive 
strategies whereby the Expected Improvement (El) criterion is computed over a likelihood ratio. 
The El criterion is used to optimize black box functions when a limited number of simulations is 


allocated (Jones et al., 1998). Recently, it has been applied to solving a problem of optimization 
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under uncertainty when the code inputs are (x, u) where x denotes a vector of control variables 
and u denotes a vector of random variables (Janusevskis and Le Riche, 2013). In the same spirit, 
we propose new algorithms which consist in applying the El criterion to the sum of squares of 
the residuals between the code outputs and the field measurements when the code inputs are 
(x, r) where r is a vector of parameters. In this way, we aim at reducing the uncertainty due 
to the GPE in regions of high posterior density. 

This paper is divided into five sections. In Section 2, the statistical framework is introduced, 
then the main features of the GPE are recalled. In Section 3, two new algorithms for Bayesian 
calibration based on the Expected Improvement criterion are presented. Their performances are 
illustrated on two academic examples in Section 4. The conclusions of this paper are provided 
in Section 5. 


2 Calibration framework 


Notations and modeling Let r(x) £ I be a physical quantity of interest where x = 
(a; 1 , ■ ■ • ,x d ) 6 X C is a vector of control variables. This kind of variable is measurable 
in the field experiments and characterizes the system. It can include both physical variables 
(temperature, pressure, velocity, etc.) and design variables (height, area, etc.). We suppose that 
a number of field experiments, say n, has been collected. In this paper, the sites of field exper¬ 
iments will refer to as the matrix Xj = [(x^) T , • • • , (xj) T ] E M na r(R) and the corresponding 
measurements will refer to as the vector z f = (z|, • • • , z^) T . Due mainly to the measurements 
errors, Zf is not exactly equal to ?’(Xj) = (r(xj-), • • • , r(x^)). Hence, for 1 < i < n, 


where 


z) = r(xj) + e\ 
e* ~ A/"(0, A"), 


( 1 ) 

( 2 ) 


is modeled as a white noise. The variance A 2 is assumed to be known because in many cases, 
either the precision of the measuring device is known or it can be estimated from replicates. 

Let y T (x) be a deterministic computer code which predicts r(x) where reTc M m is a vector 
of parameters including either factors attached to the field (chemical rate, friction coefficient, 
etc.) or mathematical tuning factors such as a discretization step having no counterpart in 
physics, or perhaps both (Gang et al. 2009). The computer code is seen as a black box function, 
which supposes nothing is known about the connection between the inputs (x, r) and the output 
y T (x). A simulation refers to a code output y T (x). A numerical design of experiments means a 


set of input locations from which the code is run (Koehler and Owen, 1996). Following Kennedy 


and O’Hagan (2001), the computer code should be considered as an imperfect representation of 

r ( x ) = 2 /e (x) + 6(x), (3) 


the phenomenon r. Hence, 


where 6(x) is the code discrepancy and 6 is the optimal value of parameters. Combining ([3]) and 
([!]), the statistical model which links the simulations with the field measurements is written as 


z f — yo(~x- l f) + &( x /) + ■ 


(4) 


The computer code is said to be calibrated when an estimator 6 of 6 has been calculated as 
being the “best-fitting” parameter according to the statistical model Q. Due to potential 
non-identifiability between d and b{x) in this model, the estimation of 6 requires to set some 


3 















prior hypothesis on the shape of 6(x). This issue has been widely studied over the past decade. 

) and many others have modeled 6(x) by a 
proposed to use a more common regression 


In our statistical setting, we suppose that 6(x) is negligible in the sense where it can not be 
distinguished from the noise e. If 6(x) is significant, then the method developed in this paper 
could be still applied if it has been elicited from prior expertise. 


Kennedy and O’Hagan (2001 ), |Higdon et al.l (2004 


Gaussian process, 
model instead. 


Joseph and Melkote (2009) have 


The unbiased model Let us suppose that 6(x) = 0. In other words, for at least one value in 
T denoted by 6 , the computer code is supposed to be a perfect representation of the physical 
phenomenon r, which means that 


36 eT ; Vx e X, r(x) = y e (x). 
Combining ^ and (jTj) leads to the equation 

z) = y e (x}) + e\ 


(5) 


( 6 ) 


We have chosen to conduct a Bayesian inference of 6 because it has been shown better suited than 
the standard MLEp" where flat likelihood may need regularization, especially if the dimension 


of 6 is high (Kumar, 2008). Let yg(Xj) := (yg (xj), • • • , ye(xj)) be the vector of code outputs 
running over the input held data X f. The posterior distribution of 6 given by the Bayes formula 
is a normalized version of the following product: 


U{0\z f ) oc C(z f \0)U{0), 

1 

exp 


oc 


(\/27rA) r 


~2A^ (0) J n(0) ’ 


(7) 


where 


SS(0) = \\z f -y 0 (X f )\\ 2 (8) 

is the sum of squares of the residuals between the simulations yg(X.f ) and the held measurements. 
Throughout this paper, H(6\zf) is referred to as the target posterior distribution. No closed- 
form expression exists for n(0|zy) because y is usually highly non linear with respect to 6. 
In such cases, n(0|z j) needs to be sampled by running an MCMC algorithm which converges 
to n(6»| Z/ ) over a very large number of samples, often several thousand (Robert and Casella 


1998). In our framework, the MCMC algorithms are thus impractical since each sample requires 
a time-consuming simulation. A way to circumvent this issue consists in setting a Gaussian 
Process Emulator (GPE), denoted by Y '(.), as a prior distribution on y{.) where (.) corresponds 
to a pair of inputs (x, r). In the following, we need to employ the notation reT which refers 
to any value of the code parameter whereas r = 6 refers to the optimal value to be estimated. 


Gaussian process emulator The Gaussian process was first developed within the field of 
computer experiments by Sacks et al. (1989). It is the most familiar surrogate model used to 


mimic a costly computer code. From a Bayesian perspective, the Gaussian process, denoted in 
this paper by K(.) 


should be considered as a prior structure on the code (Currin et al. 


1991): 


(9) 


where 

1 Maximum Likelihood Estimation 
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• mp(.) = h(.) T (3 where h(.) = (hi(.),--- ,h p (.)) T is a vector of regression functions and 
(3 £ is a vector of location parameters, 

• cr 2 the variance of the process, 


• C^r(.,.) is the correlation function where ’Pis a vector of hyper-parameters including a 
range parameter and possibly a smoothness parameter. 

The choice of the correlation function C^(.,.) should depend on the prior information about the 
shape of the code output over X x T. In addition, for both practical and theoretical reasons, a 
stationary function is almost always specified (Stein, 1999). Let Djv £ (X x T) N be a numerical 
design of experiments: 

Djv := {(x 1 ,!- 1 ),--- ,(-x n ,t n )}. (10) 

After running the code over D^v, N simulations can be collected: 

. T 


y( Djv) : = 


l ) 'T 1 ) := 2 /ti( x1 )>"- ,y{* N ,'r N ) := v t n(x n )^ . (11) 

Let 'Vpred and v' pre(] be two vectors in X x T. Then, 

• S^(D^r) = C^((x*,r l ), (xP , r^))i<jj<jv is the matrix of correlations between the simula¬ 
tions y( Djv), 


• Y,q,(v pre d, D^v) = ( Cq,(v pre d , (x*, T*))i<i<Ar is the vector of correlations between Y(y pre d) 
and each of y(Djv)- 

By conditioning the Gaussian process © on the training set of simulations y(Djv), the resulting 
process is still a Gaussian process: 

Y n (.) := Y(.)\y(D N ) ~ QV($ (.), •)), (12) 

with the standard expressions for the conditional mean and covariance: 


yp^(v pr ed) =E[Y N (Vpred)] 

— TTlpiXpred) T L\j> ( "V p redi Djy) S>j>(Djv) y(Djv) n^^iXpred) 


(13) 


and 


V H,Av P redy p red) = Cov(Y N (v pred ), Y N (v' pr ed)) 

— O’ ^Ci|> (Xpredi Vp re d) Lj\J> ( "V pr edi Djv) Evj>(Dj\r) S\j> (x pre di D 


))• (U) 


The GPE is given by the conditional process (12) which yields a stochastic prediction of the 


code for any input v pre d of the input space X x T■ In the case where Vp red belongs to D^r, the 
GPE interpolates the simulations y( Djv), that is for 1 < i < N 

${x i ,T i ) = y(x i ,r i ), 


and 


V^((x',r 1 ),(x',T')) = 0 


which is expected for such an emulator of a deterministic computer code. Lastly, the capability 


of a GPE to predict well the code should be checked thanks to some validation criteria (Bas- 


tos and O’Hagan, 2008). For more details about the GPE, refer to Rasmussen and Williams 


asymptotic properties are Stein (11999]) and Bachoc (2014). 


(2006), Santner et al. (2003), Fang et al. (2006). Other more theoretical references dedicated to 
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Calibration using a GPE In Equation ([8]), the simulations y#(Xj) are replaced with a GPE 
constructed from a design of experiments Dyv- Let, 

• m^Djv) = (^(x 1 , t 1 ) t (3, ■ ■ ■ , h(x N , t n ) j /3) T be the mean vector of the Gaussian process 
evaluated in each location of D^r, 

• rrip(X-f, 6) and £^(Xj, 6) be the mean vector and the correlation matrix of the Gaussian 
process, each evaluated in (Xj, 6 ) := ((x|, 6), ■ ■ • , (x^, e)) T , 

• £^(Dat, (Xj, 0)) be the correlation matrix between D^v and (Xy,0). 

Then, we now consider the available data d := (y(Djy), Zf). The joint likelihood of 6 and (/3, 
a 2 , ’E) is given by 


C F (d\d,a 2 ,f3,^) oc 




1 - 1/2 


(jiV+n _ ® X P 2 ( 7 2 L 


(d - (mp(D N ), m /3 (X / ,d))) T 

C^ 1 (d-(mp(D N ),mp(X.f,0))) , (15) 


where 


C* = 


£*(Djv) E*(D JV ,(X / ,fl))' 

£^(D 7V ,(X / ,0)) T E*(X / ,0) + ^I n 


In the previous paragraph about the GPE, the parameters (3 , <r 2 and tL have been assumed 
to be known. If they are not, their estimation should be conducted jointly with 6 based on 


the full likelihood (15) (Higdon et al., 2004). However, inspired from the pioneering work of 


Cox et al. ( 2001| ), a two-step procedure can be conducted instead. This technique, known 
as modularization in Liu et al. (2009), is still used in a recent work dealing with calibration 
(Gramacy et al. 2014). It first consists in estimating the parameters /3, a 1 and \l/ only thanks 


to the simulations y(Djv) by maximizing the marginal density of y(D^v), denoted by : 




|£*(Div)i ~ V2 


(7 


N 


exp 


2a 2 


(y(D n) - mp(D n))^ 


D^(Dat) (y(Djy) - m^(Djv)) 


(16) 


Then, the C M, s maximum likelihood estimates (MLEs) (<t 2 ,/3, tP) of (cr 2 , /3, \P) are plugged into 
the likelihood of z f conditional to the simulations y(D;v), denoted below by C c \ 


£ c (z f \G,y(D N )) oc \V£. a (0) + X 2 I n \- 1/2 


exp 




+ A 2 In)- i (z / -/x^(X / ,0)) \ (17) 


i2r \—1 


N 

v 


where 

and 


:= >/ 2 3,*( x /’ 6 0) T > 

v£. 2 (0)(i,j) = Co v(y JV (x>,0),y JV (x},0)). 
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This modular approach goes hand-in-hand with intuition whereby the estimation of the 
calibration parameter 6 should be separated from the estimation of a 2 , (3 and \F because they 


are of different natures. Kennedy and O’Hagan (2001) also argue it is not a great loss to estimate 
r2 a it/ because the number of field measurements n is usually much 


<7 , (3 and only thanks to i/(Dn 
smaller than N. Liu et al. (2009) has put in light a poorer mixing in the MCMC routine 
based on the full likelihood <mr than based on ©• From now on, the conditional likelihood 
( fl7[ ) is refered to as the approximated likelihood. Let n c denote the approximated posterior 
distribution induced by (|Tt|) . Then, 


n c (6\z f ,y(D N )) (x C u (z f \9 iy (D N ))U(6) 


c, 


(18) 


The approximated posterior (18) and the target posterior ([7]) are different in that yg(X.f) is 
replaced by the mean vector of the GPE /i^(Xj,0) and the conditional covariance matrix 
V*. 2 (0) is added up to A 2 I„. Contrary to the target posterior the approximated posterior 
is cheap to evaluate, enabling to perform an MCMC algorithm in order to estimate 6. 

Remark 1. The first stage of the modular approach neglects the uncertainty of the parameters 
a 2 , /3 and ^ by fixing them to their MLE. However, a possible manner to take in account their 
uncertainty would consist in adopting a Bayesian inference of a 2 , (3 and \F in the same way than 
6. For instance, if a Jeffreys prior distribution is specified on (/3, cr 2 ), then the distribution of the 
GPE follows a Student distribution (Santner et ah, 2003). Yet unfortunately, the conditional 


likelihood © has no closed-form expression anymore, causing an additional issue which is not 
the core of this paper. 

In presence of a code discrepancy 6(x) The calibration setting (|7|) has to be modified to 


Bayarri et al. 

(2007 

). In 

Higdon et al. (2004 

) and 

Bayarri 


et al. (2007), 6(x) is modeled by a zero mean Gaussian process 

b(.)~gv(o,oic* h (.,.)) 

where (.) corresponds here to an input x. In this case, the logarithm of the approximated 
likelihood (p~T|) is now proportional to a weighted sum of squares of the residuals: 


SS(0) = (z f - yeptf)) T (y? + A"In)- 1 (z / - ye(Xf)) 


(19) 


where Vj?(i,j ) = a 2 C& b (xf, xj-). Sometimes, physical context helps us to fixed both a 2 and 


to plausible values, as done in Craig et al 
algorithms that we present in Section |3] wii 


(2001). In such cases, Vj 1 becomes known and the 
be still practicable based on (19) instead of (l8j). 


Main goal of the paper Our work focuses on reducing the distance between the approxi¬ 
mated posterior distribution (18) and the target posterior distribution Q. The Kullback-Leibler 
(KL) divergence shows interesting theoretical properties to measure how far is a probability dis¬ 
tribution from a reference one (Cover and Thomas, 2006). It is written as 


KL(n(0|z / )||n c (0|z / , 2 /(D JV ))) = ^n(0|z / )(iog(n(d|z / ))-iog(n c (d|z / ,y(D^)))d0. ( 20 ) 
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The design of experiments Djy By using results of approximation theory, we can prove the 
proposition below. 

Proposition 1 . Under the following assumptions: 


n(0) has a bounded support T, 

the code output y T (x) is uniformly bounded on T x X, 


the correlation function (kernel) is a classical radial basis function (Schaback, 1995), 
y lies in the associated Reproducing Kernel Hilbert Space, 

the covering distances associated with the sequence of designs (D n)n tends to 0 with 
N —> oo (see Appendix), 


then, we have: 


Jim KL(U(0\z f )\\U C (G\z f ,y(D N ))) =0. 


TV—^oo 


( 21 ) 


Proof. (21) results from the uniform convergence of | log (n(0| Z j)) — log y(D 7 v))| to 0 

over T when N tends to oo (see Appendix). Then, we can exchange limit and integral in (20), 
which completes the proof. □ 

This proposition reflects the fact that the calibration based on the approximated posterior 


(18) is consistent, in other words the larger D n the closer the approximated posterior (18) to 


the target posterior ([7]). However, when N is small according to the dimension of the input 
space X xT, (18) can be significantly different from 0 leading to a large KL divergence ( |20[ ). 
Although an approximate posterior distribution constructed from an accurate GPE is likely to 


yield a small value of the KL-divergence (20), our own experience has shown such behavior is 
not systematic. 


In practical use, D^v is built as a space-filling design on the input space X x T (Pronzato 


and Muller, 2012), and thus not taking into account that x and t have different roles to play in 


calibration. In fact, our interest is not to predict well the computer code over X x T but rather 


to minimize the KL-divergence (20), which can be developed as 


KL(n(0| Z/ )||n c (0| Z/ , 2/ (D A r))= / n(0| Z/ )(c-cv(0))d0 + 2 / n(6>| Z/ ) 

JO ** J0 

(( Z/ - $ (X /5 0)) t (U^ 2 (0) + A 2 I n ) -1 ( Z/ - $(X f , 0)) - SS(0))d6 (22) 


where 


and 


n . 


C=-^logA 2 , 


cv (0) = - A log | V $ a 2 (0) + A 2 ! 


(23) 


(24) 


According to Equation (22), the KL-divergence could be minimized by reducing the uncertainty 
of the GPE over input locations in the subspace Xj x T n C (X x T) n and above all where 
the target posterior distribution n( Z j|0) is high. In Section |3j we propose new algorithms for 
building proper numerical designs Djv designed for this purpose. 


















3 Adaptive designs for calibration 


To identify the global minimum of a costly black box code, denoted by / (to avoid confusion with 
the calibration setting), the most efficient strategies are based on the Expected Improvement 


(El) criterion (Jones et ah, 1998). They consist in identifying sequentially the input locations 


where the code / should be run to be close to the global minimum, which is relevant when only 
a small number of simulations is allocated. Assuming k simulations f(D k ) have already been 
run, the El criterion assesses the expected improvement of a new run (numbered k + 1) in terms 
of getting close to the unknown global minimum of /. Let v k +i be the input where the El value 
is at its highest, meaning that, 

v fc+ i = argmax EI k (v), 

= argmaxE[(m fc - F fc (v))l Ffc(v)<m J , (25) 

V 

where 

• F k (v) is the current GPE which is built from D^, 

• m k = min{/(vi), • • • , /(vfc_i), /(v^.)} is the current value for the minimum. 

If a deterministic emulator were used instead of F k (v) 1 for instance the mean /u(v) of F k (v), 
the El criterion would be just the difference m k — /r(v) if /r(v) < m k and 0 if /r(v) > rn k ■ Given 
F k (v) is stochastic, Equation (25) is written as the expectation of this truncated difference with 
respect to the distribution of F\ The algorithm that consists of running the code at the input 
Vfc_|_i then updating the emulator and starting again is called Efficient Global Optimization 
(EGO) (Jones et al., 1998). The convergence of the EGO algorithm to the global minimum of / 
has been proven with respect to some assumptions about both the smoothness of the code and 
the correlation function of the GPE (Vazquez and Beet, 2010). In current use, the algorithm 


is stopped when the number of allocated simulations is over or when the improvement of m k 
becomes negligible. According to this last criterion, EGO requires less simulations than other 


optimization methods with comparable levels of performance (Ginsbourger 2009) 


El designed for calibration Our contribution now consists in resorting to the El criterion 
for the sum of squares of the residuals function SS(0): 

EI k (9)=E[(m k -SS k (d)) l ssfc(0) < m J G [0 ,m k ], 

where 


(26) 


• m k := min{S'S'(0i), • • • , SS{0k- 1 ), SS(O k )} and SS(-) denotes the sum of squares com¬ 
puted from actual runs of the computer code y, 

• SS k (.) denotes the sum of squares of the residuals where y is replaced with the GPE 
conditional to y(D k ), denoted by 

Y k (.) = Y(.)\y(D k ), 

where the subscript k now refers to the current iteration of the algorithm. SS k (.) is thus a 
random process and its distribution inherits from the current GPE. At step k, nk new simulations 
need to be run to update m k . Hence, the design contains all the simulations yg. (x ! j) for all 
1 < i < n and 1 < j < k (do not confound with the notation in Section 2 where N has referred 
to the number of simulations). Let 6 * be the maximum of (26): 


6* = argmax EI k {6). 
e 
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Justification for minimizing SS(0) As explained at the end of the previous section, running 
the code over input locations in Xy x T where IL(zf\0) is high reduces the distance between 
H c (zf\0, y(Djv)) and H(zf\0). In addition, when no expert knowledge is available on the value 
of 9 (except perhaps a lower and an upper bound), a noninformative prior should be specified 
for 9 (Kass and Wasserman, |1996 Box and Tiao, 1973). In such cases, the higher the target 


posterior distribution H(zf\0), the lower SS(0). The KL-divergence (20) can be then reduced 
faster by running the code over the input locations (x^,0*) EXfXT where SS(9 *) is small. 
Such locations 0* can be identified by maximizing the El criterion. 

Remark 2. In some works, calibration consists in minimizing a gap between the code outputs 
and the available field measurements, such as SS(9) ( |Wong et al. 2014) or a weighted sum of 
squares like (19) (Joseph and Melkote, 2009| ). For our part, the El criterion applied to SS(9 ) is 
just a circuitous way to efficiently build D^v for Bayesian calibration. 


EGO algorithms Algorithm 1 corresponds to an exact EGO algorithm based on Equation 
( J26| ). It aims at identifying the input locations (x^,0*) 6 XjxT which will be added up 
sequentially to an initial design Do for N iterations. Algorithm 2 is a one at time algorithm and 
should be understood as an approximation of Algorithm 1. 


Algorithm 1 


Initialization 

• Build a maximin Latin Hypercube Design (LHD) Do C X x T of size Nq. 

• Run the code over Do- 

• Compute 6\ as the maximum of n c '(0|zj, Do). 

• Di = Do U {(Xj, 0\)}l<i<n- 

• Update the Gaussian process distribution after running the code over {(x).,0i)}i<j< n . 

• Set m\ := SS{9\). 

From k = 1 , repeat the following steps as long as Nq + n X (k + 1 ) < IV. 

Step 1 Find an estimate 9k+i of 0£ +1 = argrnax EI k (9). 

e 

Step 2 D fc+ i = D fc U {( x jf i @k+ 1 )}l<i<n- 

Step 3 Run the code over all new locations {(xj, Ok+i)}i<i<n- 
Step 4 Update the Gaussian process distribution based on y(Dfc +1 ). 

Step 5 Let m^+i ■= min {mo, • • ■ , m^, SS(9k+ 1 )}. 

Because the distribution of the GPE is updated at Step 4, the hyper-parameters /3, a 2 and 


are re-estimated, as done in the seminal EGO work (Jones et ah, 1998). Algorithm 1 could 


be efficiently performed by running the new simulations at Step 3 simultaneously on several 
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computer nodes. We present below the steps of a one-at-a-time algorithm well-suited when the 
computer system has a single processor. 


Algorithm 2 


Initialization is similar to Algorithm 1 except that Di = Do U{(x* fir)}. 

For k = 1, • • • , N — No, repeat the same steps as in Algorithm 1 except that Step 
3 is replaced with Step 3 and Step 5 is replaced with Step 5. 


Step 3 Run the code in (x*,0fc +1 ) where x* = argmaxCrit(xj, dk+i) (see Equations (27) 
and (29) below). x / eX / 


Step 5 m k+ 1 :=min{E[S5 fc (0i)],--- , E[SS fc (0 fc )], E[SS fc (0 fc+1 )]} 


Algorithm 2 should be understood as an approximation of Algorithm 1 where only a single 
simulation y§ h _ ] ( x *) is run at each iteration. In the following, two criteria are proposed to pick 
up x* among Xf. As the current minimum m k in Algorithm 1 can not be computed anymore, 
we have replaced it with the minimum of the expectations E[S'S' fc (0j)] for 1 < i < k + 1 with 
respect to Y k (.). 

The first criterion is done to run the code at the input location (x*, 9 k ) E Xj x T where the 
variance of Y k (.) is at the highest. Hence, 


Crit (x i f ,O k+1 )=Y[Y k (x i f ,d k+1 )}. 


(27) 


As the variance of the Gaussian process decreases on the space Xj- x T, the approximated 
posterior distribution (18) comes close to the target posterior distribution which justifies 
(27). Yet, a better way might perhaps consist in aiming for a reduction of the GPE uncertainty 


at an input location (x*,0fc_|_i) where the code yg(x*) is highly variable over 9 , meaning that 
x* is important for calibration. We thus introduce a second criterion which does a trade-off 
between the calibration goal and (|27l). A normalized version of it is written as 


Crit(xj, 9 k+ i) = 


Vfoe(x»] 


V(y*(x),g fc+1 )) _ 

nax^ Y (Y k (x) , 9 k+1 )) * . max^ Y[y e (xj,)] ’ 


max 
i= 


(28) 


where V[ye(xj:)] is taken with respect to n(0). Since the code is unknown, an approximation of 
(28) can be based on the mean of Y k (.): 


Crit(xY 0 


/) °k+l) 


Y{Y k (^ fl e k+l )) 


V[^(xj,0)] 


max V(Y fe (x i f ,0 fc+1 )) max V[^(xj.,0)] 


(29) 


Remark 3. A typical problem inherent to sequential designs is when two input locations come 
very close, making the covariance matrix numerically singular and thus difficult to invert. This 
issue can arise when both 0 k is too close to a previous iteration 6 k > and x* is almost the same 
at iteration k, then at k' > k. The usual way to circumvent this issue consists in adding a small 
diagonal matrix to the covariance matrix ff2 (9) of the GPE, called nugget effect. 
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Remark 4. The expectation in (26) is taken with respect to the multivariate Gaussian distri¬ 


bution induced by the distribution of the GPE. Another way would have been to emulate the 
function SS(6) as a Gaussian process, forcing n simulations to be run at each iteration. Doing 
so, one at a time algorithms would have been impossible, making this strategy irrelevant for 
large values of n. 


Computation of the criterion By expanding Equation (26), we have : 


EI k (0) = m k 


F[SS k {G) < m k ] - 


E[S5 fc (0)l gsfcw < mfc 

m k 


> 0 , 


implying 


E[SS k (d)l ssk{0) < mk } < m k F[SS k (G) < m k \. 


(30) 


(31) 


Except in the trivial case n = 1, no closed form can be calculated for (30). The expression 


of EI k (0) is proportional to the sum of F[SS k (G) < m k \ which is the probability of sampling 
inside the hypersphere 13(0, y/m k ) from a multivariate Gaussian distribution and a second term 
which is the expectation of the right truncated SS k (G ) with respect to m k . The first term can be 


calculated either as an infinite series in central chi-square distribution (Sheil and Muircheartaigh 


1977) or thanks to an advanced sampling rejection method (Ellis and Maitra 2007). This second 


method should be preferably performed because the second term has to be estimated using 
MCMC. The minimization of (30) can be performed in a greedy fashion where G k +i is taken 


as the value which maximizes EI k (G) over a grid G £ T. However, the computation of EI k {G) 
could be avoided for some candidates of G, as explained hereafter. 


Computation of G k +\ 


1. Compute F[zf — Yg £ [— 7n k ,m k } n ] which is an upper bound of F[SS k (G) < m k \ for each 6 
of G. 

2. Let 6 = argmaxPfzj — Yg £ [— m k , m k ] n ] be a reference value. 

0eG 

3. Compute EI k (G), 

4. Build the sub-grid G = {6 £ T ; EI k {6) < P[zy — Yg £ [— m k ,m k ] n ]} C G, 

5. Compute EI k (G ) for the values of the sub-grid (5, 

6. Let G = argma xEI k (G). 

e&G 

For G £ G \ G, we have EI k (G ) > P[z f — Yg £ [—m k , m k ] n ] implying EI k (G) > EI k (G). Hence, 
there is no need to compute EI k (G). Unfortunately, this algorithm is only relevant when n is 
small because in higher dimensions, the hypercube [— m k ,m k ] n has a much larger volume than 
the hypersphere. 

Such a greedy optimization works very well on our toy examples, especially in small dimen¬ 
sions of T because G can be constructed fine enough (see Section |4|. In higher dimensions, 
the El criterion could be maximized alternatively over several different grids as iterations of the 
EGO algorithm (see Section 4). However, if G is coarser, we can expect our algorithms stay 
efficient in terms of reducing the KL-divergence because they do not aim at converging precisely 
to the global minimum of SS(G), but rather identifying the area of the input space where SS(G) 
is small. 


12 





















4 Simulations 


Figure 1: Left: the function yt(x) = ( 6 x — 2) 2 x sin (tx — 4) on [0,1] for several values of 
t G [5,15]. Red dots are the field measurements (Xf,Zf) generated by Equation (33). Right: the 
target posterior distribution. 




A 2D example Let us assume that the computer code is given by the following function: 

Vt ■ x —» y t (x) = ( 6 x - 2) 2 x sin (tx - 4), (32) 

where x G X = [0,1] and t G T = [5,15]. For 1 < i < n, the field data z f are generated by 

z f = ye(x l f) + (33) 


where 6 = 12 and ~ AA(0,0.3 2 ). Bayesian calibration of the function (32) is done by sampling 
the target posterior distribution ([?]) (see Figure [l]) where the prior distribution n($) is chosen 
as uniform on [5,15]: 

n(0) oc 1 [5,15] • (34) 


Now, Bayesian calibration of (32) is done by sampling the approximated posterior distribu¬ 


tion (18). Two cases are considered: with n = 3 field measurements, then with n = 9 field 


measurements. 


Case 1: X f = (0.1, 0.3, 0.8) The approximated posterior distribution (|18|) is sampled from a 


GPE with a constant mean m /3 = m and a Matern 5/2 correlation function. All the parameters 
m, ct 2 ,T have been estimated under the modular approach based on a maximin LHD of size 
N = 30 constructed with the R library DiceDesign. This calibration procedure is repeated twice 


by sampling (18) from two different maximin LHD. According to the first one, as illustrated in 


Figure [ 2 ] (left), the regions of high posterior density corresponds to those of the target posterior 
distribution but the two modes are reversed in terms of height. According to the second one, 
as illustrated in Figure [ 2 ] (right), the posterior distribution is very erroneous. Such calibrations 
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Figure 2: Sampling of (18) from two different maximin LHD (using the R library MCMCpack). 




where the approximated posterior distribution is not close to the target one are unwanted, 
which justifies constructing some oriented designs for calibration instead of default space filling 
designs such as maximin LHD. In the following, three adaptive strategies building on the EGO 
algorithms from Section [3] are presented: 


1. a first version based on Algorithm 1. At each iteration, the code is run at all three inputs 
(0.1, Ok), (0.3, 6 k), (0.8, 9k) where 8 k is the current parameter maximizing the El criterion. 
An example of a design obtained with this algorithm is displayed in Figure [3] (upper right). 

2. a second version based on Algorithm 2. The code is run at a single input (x*,9k) where 


x * comes from the input ( Xi,9k ) having the highest variance (27) among the three inputs 
(0.1, 9k), (0.3, 8 k), (0.8, 8 k). An example of a design obtained with this algorithm is displayed 
in Figure [3] (bottom left). 

3. a third version based on Algorithm 2. The code is run at a single input (x*. 9k) where x * 


comes from the input ( Xi,9k ) which maximizes the criterion (29) among the three inputs 
(0.1, 9k), (0.3, 9k), (0.8, 9k). An example of a design obtained with this algorithm is displayed 
in Figure [3] (bottom right) 
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Figure 3: Case 1. Upper left: a maximin LHD (N = 30J. Upper right: a sequential design built 
from Version 1 (Nq = 12). Bottom left: a sequential design built from Version 2 (Nq = 12). 
Bottom right: a sequential design built from Version 3 (No = 12). The black dots are the initial 
design. The red stars are the new experiments selected from the El criterion. 




Let us now assess how good is the calibration by computing KL(II(#|z/)||n r (9\zf, y(Djv)) 
according to the kind of design Djv which is used. The proportion of the time that the 95% 
credibility interval of (18) covers the true value is computed as well. The robustness of the 
results is checked by repeating sampling of the approximated posterior distribution (18) many 
times. Here, calibration is performed from 50 data sets ( ~Kf,Zf ) and for each of them, 50 
calibrations derived from 50 different designs have been conducted. In Figure [4j the boxplots of 
the KL divergence and the boxplots of the coverage are each plotted against the kind of design 
(including the maximin LHD and the sequential designs coming from the EGO algorithms). 
Each value of the boxplot is computed as the mean of the criterion over the 50 calibrations 
derived from a particular data set (Xj,zj). A sharp decrease in the KL divergence is noticed 
when the calibration is done with a sequential design. We can see that both versions 2 and 3 have 
the lowest values for this criterion. The reason is that Algorithm 2 can move more quickly inside 
the parameter space T, which is expected when the target posterior distribution is multimodal. 
Results about coverage look the same although some lower values can be seen. In such cases, 
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the coverage is poor because the true value (6 = 12) is not covered by the 95% interval of the 
target posterior distribution. The same feature is thus expected with the approximated posterior 
distribution. 

Figure 4: Case 1. Left: boxplots of the KL divergence computed between the target posterior 
distribution and the approximated posterior distribution (using the R library FNN). Right: ability 
of the 95% credibility interval to cover the true value (6 = 12). 



-I-I-I-I- -I-I-I-1- 

maximin LHD version 1 version 2 version 3 maximin LHD version 1 version 2 version 3 


Figure 5: The target posterior distribution (Case 2) 


Case 2: Xf = (0.1,0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9) The held data z f are still generated by 

z) = yo(xf) + e* for i = l,---,n, (35) 

where 9 = 12 and e* ~ A/"(0,0.3 2 ). As n is larger, the target posterior distribution U(9\zf) 
now has a single narrow mode around the true value. Similar to the first case, the calibration 
results are improved by using adaptive designs (see Figure [7]). As the number of held data is 
larger, the one-at-a-time Versions 2 and 3 outperform Version 1 because they do not need all 
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Figure 6: Case 2. Upper left: a maximin LHD (N = 30J. Upper right: a sequential design built 
from Version 1 (Nq = 12). Bottom left: a sequential design built from Version 2 (No = 12). 
Bottom right: a sequential design built from Version 3 (Nq = 12). The black dots are the initial 
design. The red stars are the new experiments selected from the El criterion. 
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the evaluations corresponding to a 6 around 14 to discard this area, and evaluations remain to 
refine the sampling around 9 = 12 (see Figure [6]). 


A 6D example Inspired from Saltelli et al. (2000), let us now assume that the computer code 
is given by the following function: 


9 t - x 


ff-r(x) Yl 


Z— 1 


| 4 Xi - 2 | + n 

1 + Tj 


(36) 


This highly non-linear function is used within the held of global sensitivity analysis to assess 
the efficiency of new methods (Marrel 2008). For 1 < i < n = 60, the held data z f are still 
generated by 

z f = 9e{x l f) + e*, (37) 
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Figure 7: Case 2. Left: boxplots of the KL divergence computed between the target posterior 
distribution and the approximated posterior distribution (using the R library FNN). Right: ability 
of the 95% credibility interval to cover the true value (9 = 12 ). 


maximin LHD version 1 version 2 version 3 


maximin LHD version 1 version 2 version 3 


where 6 = (0.34,0.34,0.34) and e* ~ A7(0,0.05 2 ). Here, both x and 6 are three-dimensional 
vectors. In the same spirit as before, we aim at reducing the calibration error between the 
unknown target posterior ([7]) and the approximated posterior (18). The prior distribution n(0) 
is chosen as uniform on [0,1] 3 : 

n(0) oc 1 [o,i]3 • (38) 


Let the allocated number of simulations be equal to IV = 200. Similar to the 2D example, 
maximin LHD are compared with the sequential designs. Given the size of field data (n = 60), 
only Version 2 and Version 3 are performed, both starting from an initial design of size Nq = 100 
simulations. As explained in Section [3j a discretization G of the parameter space is required 
for maximizing the El criterion. Given that the dimension of 6 is larger than in the previous 
example, the choice of such a grid is more sensitive. Indeed, if G is too coarse, some promising 
area of the parameter space could be not explored whereas if G is too fine, the computation 
time will be drastically increased. The solution that we suggest to address this problem consists 
in maximizing the El criterion alternatively on several designs, as iterations, so that: 


G = Gi n G 2 ,n- • • n G M 

(39) 

Gi n G 2 n • • • n G M = 0. 

(40) 


In this example, for simplicity we have chosen M = 2 grids and, 

Gi = [0,0.2, 0.4,0.6,0.8,1] x [0,0.2,0.4, 0.6, 0.8,1] x [0, 0.2, 0.4,0.6, 0.8,1] 


and 

G 2 = [0.1,0.3,0.5, 0.7, 0.9] x [0.1,0.3,0.5, 0.7,0.9] x [0.1,0.3,0.5, 0.7,0.9] . 


Thus, for odd iterations the El criterion is maximized over G\ whereas for even iterations the El 
criterion is maximized over G 2 . The calibration results are illustrated in Figure [8} They look the 
same as those of the 2D example, which again support the advantage of using a sequential design. 
Let us see that neither G± nor G 2 covers the unknown true value. One can think that a more 
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exhaustive decomposition in (39) would make G closer to the true value 6 = (0.34,0.34,0.34), 
perhaps making the results even better. Boxplots for coverage rate have larger variance than 
in the previous examples because each of the 95% credibility interval of the three marginal 
posterior distribution needs to cover the true value 0.34. 


Figure 8: Left: boxplots of the KL divergence computed between the target posterior distribution 
and the approximated posterior distribution (using the R library FNN). Right: Coverage rate. 
For both figures, boxplots are made over 50 calibrations. 




5 Conclusions 

This paper deals with new adaptive numerical designs for calibrating time-consuming computer 
codes in a Bayesian setting. For such codes, Bayesian calibration is based on a Gaussian process 
emulator (GPE) which approximates the code output and thus making the MCMC algorithms 
practicable. After choosing a design of experiments, the GPE is estimated by using a modular 
approach which allows us to separate the estimation of GPE parameters from the estimation of 
the code parameters. 

Our contribution consists in taking advantage of the stochastic property of the GPE to build 
sequentially the design of experiments in such a way that the gap between the posterior distri¬ 
bution based on the GPE (the so-called approximated posterior distribution) and the posterior 
distribution based on the code (the so-called target posterior distribution) is the smallest pos¬ 
sible in terms of the KL-divergence. We have shown it is of a great importance to reduce the 
uncertainty of the GPE where the density of the target posterior distribution is large. In an 
objective Bayesian context where no prior expertise is available about the unknown parameters, 
this goal is equivalent to reduce the uncertainty of the GPE where the sum of squares of the 
residuals between the code outputs and the field measurements is small. We have thus proposed 
sequential strategies for building the design of experiments based on the El criterion designed for 
the sum of squares of the residuals. Our simulations on toy functions have shown such designs 
outperform space-filling designs in terms of low value of the KL- divergence. However, simula¬ 
tions have been performed in an unbiased framework where there is no discrepancy between the 
physical system and the computer code. If prior information is available about the shape of the 
code discrepancy, our algorithms could be applied to a weighted sum of squares function in a 
similar way. 
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It may appear surprising that the prior distribution is not taken into account in the El 
maximization. That is because the prior distribution is identical both for the approximated 
posterior distribution and the target posterior distribution. Furthermore, if we aim at conducting 
a sensitivity analysis to the choice of the prior distribution, then we would like to be able to 
perform several calibrations from different prior distributions without having to rebuild a GPE 
each time. 

The main difference between the algorithms presented in this paper and the current use of 
the EGO algorithm is that the objective function, that is the sum of squares of the residuals, 
is not modeled by a Gaussian process. Such modeling makes possible to conduct one-at-a-time 
sequential strategies where a single simulation y^ fc (x*) is run at each iteration. Two criteria 
have been suggested to pick up x* among the input field measurements but more sophisticated 
criteria might be more relevant. In addition, we might be also think about an EGO algorithm 
which would be designed according to the number of available computers nodes. 

Another concern is how to maximize the El criterion over the parameter space. Because it has 
no closed form expression, the optimization is performed in a greedy fashion, but free-derivative 


optimization algorithms could be also used (Conn et al. 2009). 


Finally, our EGO algorithms start from a maximin LHD on X x T whereas the posterior 
distribution only depends on the simulations running in the space Xj x T. A better strategy 
could be to build a space filling design on this input space. 
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Appendix : Proof of Proposition [l] 

Proof. We refer to the log-conditional likelihood as £ c (zf\6, y(Djy)) = log(£ c '(zj|0, y(D^))) 
and the target log-likelihood as £(zf\6) = log(£((z/|0)). It is sufficient to prove that 

\£ c (z f \6,y(B N ))-£(z f \6)\ (41) 


is uniformly bounded in 6 and the bound tends to zero with N —> oo. We can decompose (41) 


as 


\£ c (z f \e,y(D N ))-£(z f \0)\ < \£ u (z f \0,y(D N )) - £ u (z f \6,y(D N ))\ + 


oC 


Ct 


\£ c (zf\e,y(D N ))-£(z f \6)\ (42) 


where 


£ c (z f \e,y(D N )) = -^( z / - ^( X /> 0 )) T ( Z / -h^(X/,0)) - ~ log 27 t - n log A 

corresponds to the log-conditional likelihood where the function y is replaced with fip and the 
covariance matrix of the GPE is neglected. The second term is bounded as: 


\£ c (z f \e,y(D N ))-£(z f \6)\ = 


- 2*2 (II z / - $ ( x t 0 )H -II z / - II- 


i / n 

w ( “ $ ( x /> 0 )) ( 2 ~/ - 0«( x /) - $( x /> °)) 


2A 2 


1=1 
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Let us suppose that the minimax distance, say of the designs sequence (Dat)at tends 

to 0, namely 

hu N = max min ||(x',t') - (xi,T<)|| —> 0 (43) 

(x',t')£TxX (x.i,Ti)eD N Af->oo 


Then, the uniform bound is deduced from the point-wise bound given for standard radial basis 
correlation function (Schaback 1995). We can obtain 


M x /) -Mjj(x/,0)| < ||y||c* -Gc+OiDn), 


(44) 


where 112/11is the norm of y in the RKHS associated to and Gc^{-) tends to 0 when Ld, 
tends to 0. 


in (42) is written as, 


nC/ 


oC t 


< 


z / - 

/$(X/,0)II =o. : 

;log(|^ 2 (0) + A 2 I n |) 

; lo § ( 

<I^ 2 (0) + a 2 i„| 

v (A 2 ) n 

; lo S 1 


l A 2 

hog 


V A 2 n 


n . 


+ 1 


where the inequality of arithmetic and geometric means is used for bounding the determinant 
by a function of the trace. Using again results in Schaback (1995j), we obtain ^{9)a < 
Gc^{hr> N ). Therefore, 


n 


P (z f \0, y(Djv)) - £ c (z f \0,y(D N ))\ < CST y ■ ^G c „(h Djv ). 

Since (/id^jv tends to 0 with N, putting the two bounds together proves the uniform conver- 

□ 


gence of (41) to 0. 
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